---
title: |
  | \vspace{1.5in} Visual Heuristics for Marginal Effects Plots \vspace{1in}
abstract: Common visual heuristics used to interpret marginal effects plots are susceptible to Type-1 error. This susceptibility varies as a function of (1) sample size, (2) stochastic error in the true data generating process, and (3) the relative size of the main effects of the causal variable versus the moderator. I discuss simple alternatives to these standard visual heuristics that may improve inference and do not depend on regression parameters.

output:
  pdf_document:
    fig_caption: yes
    includes:
      in_header: preamble-latex.tex
      after-body: appendix.Rmd
    latex_engine: xelatex
    keep_tex: true
fontsize: 12pt
mainfont: Palatino
bibliography: references.bib

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=3)
library(knitcitations); cleanbib()
cite_options(citation_format = "pandoc", hyperlink="to.bib", check.entries=FALSE)
library(bibtex)
library(interflex)
library(Lmoments)
library(doParallel)
library(reshape2)
library(ggplot2)
library(truncnorm)
setwd("___EDIT HERE___")
source("marginaleffects.line.R")
```

\pagenumbering{gobble} 

\newpage
\pagenumbering{arabic} 


## Introduction

The interpretation of interaction terms in political science is a topic of wide interest [@braumoeller2004; @bramboretal2006; @kamfranzese2007; @berryetal2012; @esareysumner; @hainmuelleretal]. An influential article by @bramboretal2006, in particular, has transformed how political scientists study and interpret interative hypotheses.^[At the time of writing, Google Scholar counts 3781 citations for @bramboretal2006, and another 699 for @braumoeller2004, which presented a similar argument at about the same time.] In addition to reminding researchers that they must include both constitutive terms and interaction terms if they wish to test interactive hypotheses, the authors write that "The analyst cannot ... infer whether X has a meaningful conditional effect on Y from the magnitude and significance of the coefficient on the interaction term either... It means that one cannot determine whether a model should include an interaction term simply by looking at the significance of the coefficient on the interaction term" (p. 74). The authors propose instead to use ``marginal effects plots'' to calculate the estimated marginal effect of the variable of interest across substantively meaningful values of the moderating variable.

Marginal effects plots have since become ubiquitous in political science. Despite their ubiquity, there is little analysis of their performance as a tool for identifying interactive effects. Two recent papers have begun to look more closely at the marginal effects plot. @hainmuelleretal show that marginal effects plots (and indeed, any hypothesis test relying on an interaction term) rely on the assumption that the effect of the causal variable of interest is linear and constant across the values of the moderating variable. By contrast, @esareysumner argue that marginal effects plots usually have inappropriate coverage because of the problem of multiple comparisons. My contribution in this manuscript is to draw attention to the visual heuristics that researchers implicitly use when they interpret marginal effects plots. 

In this article I demonstrate that applied researchers have drawn incorrect conclusions from @bramboretal2006. The appropriate test for the presence of linear interaction effects *is* given by the significance of the coefficient on the interaction term. Commonly used visual heuristics, which I identify below, will often fail compared to this test. Marginal effects plots have other uses, but they should not be used to test for the presence of linear interaction effects. 

Focusing on Type-1 error, or the problem of false positives, I ask the following question: *how frequently will visual inspection of a marginal effects plot suggest that interaction effects exist when the true data generating process is not interactive?* To investigate, I generate simulated data for a binary treatment variable $D$ and an additional predictor $X$. The true data generating process is $Y = \alpha + \beta_1D + \beta_2X + \epsilon$, where $\alpha=\beta_1=\beta_2=1$. Here, $X$ does not moderate the effect of $D$ on $Y$. I then used a marginal effects plot to test whether the effect of $D$ varies across values of $X$ by estimating the regression $Y = a + b_1D + b_2X + b_3X \cdot D$. I use the \texttt{inter.binning} command in the \texttt{interflex} package in @hainmuelleretal, which produces marginal effects plots as well as what they term a "binning estimator" that allows for the effect of $D$ on $Y$ to be nonlinear in $X$. The results are in Figure \ref{fig:margDY}.

```{r setup.example, echo=FALSE, results="hide", warning=FALSE}
#set.seed(14850)
#n <- 300
#error <- 4

#d <- rbinom(n,1,.5)
#x <- rnorm(n)
#e <- rnorm(n,0,error)
#y <- 1 + d + x + e


set.seed(14850)
n <- 250
error <- 4

D <- rbinom(n,1,.5)
X <- rnorm(n)
e <- rnorm(n,0,error)
Y <- 1 + D + X + e

fit <- lm(Y~D*X)

```

```{r example.plot, echo=FALSE, results="hide", warning=FALSE, fig.cap="The Marginal Effect of D on Y\\label{fig:margDY}",}
#s <- cbind.data.frame(D=d, X=x, Y=y)
s <- cbind.data.frame(D=D, X=X, Y=Y)
p <- inter.binning(Y = "Y", D = "D", X = "X", data = s, vartype = "robust")
p$graph + theme_classic() + geom_hline(yintercept=0)
```

\noindent Visual inspection of Figure \ref{fig:margDY} suggests that the effect of $D$ on $Y$ is positive at high values of $X$, and is indistinguishable from zero at low values of $X$. (The same conclusion emerges from a visual inspection of the binning estimator as well.) In practice, it is common for empirical researchers to conclude from marginal effects plots of this sort that the effect of $D$ on $Y$ depends on the value of $X$, and to reject the null hypothesis that there is no interactive effect between $X$ and $D$.^[Here is the relevant text from @bramboretal2006, who discuss a nearly identical plot: "confidence intervals around the line allow us to determine the conditions under which presidential elections have a statistically significant effect on the number of electoral parties---they have a statistically significant effect whenever the upper and lower bounds of the confidence interval are both above (or below) the zero line. It is easy to see that temporally proximate presidential elections have a strong reductive effect on the number of electoral parties when there are few presidential candidates. As predicted, this reductive effect declines as the number of presidential candidates increases. Once there are more than 2.9 effective presidential candidates, presidential elections no longer have a significant reductive impact on legislative fragmentation," p. 76.] 

That conclusion is incorrect. The coefficient on the interaction term, $b_3$, *is the* test of whether the effect of $D$ on $Y$ depends on the value of $X$ [see @kamfranzese2007, p. 50]. In the above example, we know the true data generating process, so we know that the effect of $D$ on $Y$ does not depend on the value of $X$. And indeed, the p-value on $b_3$ is `r round(summary(fit)$coefficients[4,4], digits=3)`, which does not to reject the null hypothesis that the effect of $D$ does not depend on $X$ at the 95% confidence level. In the remainder of this article I clarify what a marginal effects plot tests and how this is different from the hypothesis that the effect of $D$ on $Y$ varies by $X$; illustrate the consequences of using marginal effects plots to test the latter hypothesis; recapitulate standard recommendations about how to test interactive hypotheses; and propose a more informative marginal effects plot that discourages their misuse.


## Learning from Marginal Effects Plots
Marginal effects plots contain two pieces of information. The first is the slope of the "marginal effect line," which is determined by the coefficient $b_3$. The second is the width of the confidence intervals, which depend on the estimated variances and covariances between $b_1$ and $b_3$. From these pieces of visual information, the researcher makes inferences about the data generating process using heuristics: is the slope positive or negative, is the slope "large" or "small," does the line cross zero, and for what range of values do its confidence intervals include zero?

When there is no interactive effect, the true value of $b_3$ is zero. However, estimating a regression with an interaction term will produce non-zero estimates of $b_3$; *in expectation* these estimates will be zero, but in any application they will almost always indicate a non-zero slope for the marginal effects line. Regression with an interaction term will likewise produce non-zero estimates of the variance-covariance matrix, including the covariance of the non-existing interaction effect and the main effects.

Figure \ref{fig:margDY_perfect} illustrates how such a marginal effects plot *ought* to look when there is no interaction between $X$ and $D$. 

```{r setup.good.example, results="hide", echo=FALSE, warning=FALSE, fig.cap="The Marginal Effect of D on Y\\label{fig:margDY_perfect}"}
set.seed(14850)
n <- 100000
error <- 1

d <- rbinom(n,1,.5)
x <- rnorm(n)
e <- rnorm(n,0,error)
y <- d + x + e

s <- cbind.data.frame(D=d, X=x, Y=y)
p <- inter.binning(Y = "Y", D = "D", X = "X", data = s, vartype = "robust")
p$graph + theme_classic() + scale_y_continuous(limits = c(.8, 1.1))
```

\noindent Because the effect of $D$ does not depend on $X$, the line is flat across values of $X$, and the confidence intervals are bounded away from zero. To generate such a clean visual result, however, I had to set the sample size to 100,000 and the variance of $\epsilon$ to 1.

When visual results are not so clean, researchers commonly follow one of two visual heuristics. The first, which I term the "crosses zero" heuristic, looks to see whether or not the confidence intervals capture zero for some portion of the range of $X$, and do not for some other range of $X$. If so, the inference is then that *the effect of $D$ on $Y$ is nonzero for some range of $X$, and zero elsewhere.* The second, which I term the "compare extremes" heuristic, looks to see whether there is overlap across the entire range of the confidence band. If not, the inference is then *the effect of $D$ on $Y$ differs across the values of $X$.* The range of the confidence band depends on the range of values of $X$ across which marginal effects are calculated; for purposes of this discussion, I consider the relevant range to be the observed range of $X$ in the sample.^[This is the default option in many software packages that automate the creation of marginal effects plots, such as @hainmuelleretal's \texttt{interflex} and @solthu's \texttt{interplot}.] To be clear, these two heuristics amount to tests of different hypotheses. The compare extremes estimator is already subject to critique because if the extremes of the marginal effects plot include either values that lie beyond the area of common support, then inferences are particularly fragile [see e.g. @hainmuelleretal, p. 3]. To my knowledge, the crosses zero heuristic has never been identified, but is widely used. 

One other piece of information that may test of whether or not an interaction effect exists is the coefficient $b_3$. If $b_3$ is small and statistically indistinguishable from zero, this would be evidence against the hypothesis that the marginal effect of $D$ depends on $X$. However, researchers often hold that the coefficient $b_3$ is not a test of whether interaction effects exist, because that coefficient does not provide information about the marginal effect $D$ on $Y$ at various levels of $X$, which is usually the quantity of interest. Both the crosses zero heuristic and the compare extremes heuristic may be interpreted as attempts to better study the marginal effects of $D$ across values of $X$.

We are left with what seems to be an impasse. The coefficient on the interaction term is not a meaningful test of the marginal effect $D$ on $Y$ at various levels of $X$. And yet it proved very easy to develop a case where standard heuristics based on a marginal effects plot produced misleading conclusions that interaction effects do exist. The solution is to recognize that the following two statements are entirely consistent with one another: 

1. the coefficient of $b_3$ expresses the term $\frac{\partial(\partial Y/\partial D)}{\partial X}$, or "does the effect of $D$ differ across the values of $X$?"

2. the confidence intervals around the point estimates across values of $X$, as found in a marginal effects plot, are the value-by-value correct confidence intervals for each of those conditional effects of $D$ on $Y$. Each expresses the term $\partial Y/\partial D|X$, or "what is the effect of $D$ on $Y$ when $X = x$?"

\noindent Differences in $\partial Y/\partial D|X$ across values of $X$ cannot be easily translated into evidence about whether the effects of $D$ depend on $X$. For present purposes, however, the key is that the crosses zero heuristic *does not* generally translate the latter into the former. An extensive review of the specific hypotheses tested by various interaction models can be found in @kamfranzese2007, pp. 43--92.

How frequently do researchers employ the crosses zero heuristic? I consulted each of the articles replicated by @hainmuelleretal and checked for evidence that authors explicitly based their inferences on a marginal effects plot *rather than* on the statistical significance of the interaction term. The authors argue that this sample represents "high profile" articles that likely took "special care to employ and interpret these models correctly". By my count, 7 of 22 or nearly one out of every three articles fulfill this criterion.^[Here are two examples. First, from @hellwigsamuels2007, pp. 293--294: "it is not possible to infer the extent to which globalization affects electoral accountability directly from the unconditional coefficients and standard errors from the interaction models in Table 1. Although the coefficient on the Economy $\times$ Capital Flows term is not statistically significant, the reported standard error pertains only to two specific combinations of values: the marginal effect of Economy when Capital Flows equals 0 or the marginal effect of Capital Flows when Economy equals 0. Figures 1 and 2 better illustrate the degree to which exposure to the global economy conditions the effect of economic performance on election...Once exposure to international trade exceeds 77% of GDP, the positive effects of economy on incumbent vote are no longer statistically significant, as shown by the 95% confidence-interval bands." Next, from @bodeahicks2015, p. 278: "In Model 9, we find a statistically significant and correctly signed (positive) CBI index, with a further positive coefficient of the interaction of CBI and Polity. Interaction terms and their components are, however, difficult to interpret, and Brambor, Clark, and Golder (2006) prescribe that for multiplicative interaction models, inference should be done with meaningful marginal effects and standard errors to determine the conditions under which key variables have a statistically significant effect. We plot the marginal effect of CBI as democracy increases in non-OECD countries in Figure 1(a). The marginal effect is significant only at high levels of Polity, supporting Hypothesis 2.1."] In the majority of the remaining 15 cases, the coefficient on the interaction term was itself significant, obviating the need to choose one or the other. For the same reasons that @hainmuelleretal argue that their replications represent a lower bound on the true rate of problematic multiplicative interaction terms, my count may also represent a lower bound on how often visual heuristics are used to identify interaction effects. 


## Simulations
The preceding discussion explains why it is not correct to compare marginal effects to make inferences about the *presence of* interactive effects. To illustrate the dangers of doing so, I use simulations. Based on the data generating process outlined in the introduction, I created 1000 simulated datasets and created "virtual" marginal effects plots for each. I then implemented five tests: three based on the heuristics outlined above, one based on the coefficients from the binning estimator, and one based on the coefficient on $b_3$. 

* *Crosses zero heuristic* If the estimated marginal effect of $D$ on $Y$ is both statistically distinguishable from zero across at least 25% of the range of $X$ and indistinguishable from zero across at least 25% of the range of $X$---analogous to Figure \ref{fig:margDY_perfect}---I conclude that the marginal effects plot is consistent with the presence of an interactive effect. I implement this test by checking if either of the following conditions hold: the confidence interval of the 25th (75th) percentile of $X$ captures zero, the confidence interval of the 75th (25th) percentile of $X$ excludes zero, and the confidence interval of the 97.5th (2.5th) percentile of $X$ excludes zero.^[This third condition ensures that the interaction effect looks "directional" rather than being essentially flat with the confidence intervals excluding zero only at the midpoint of the range of $X$.] Note that this implementation of the crosses zero heuristic is fairly conservative in requiring statistical insignificance across at least an entire quartile of $X$. If I had decreased this requirement to a quintile or a decile, my conclusions would be stronger.
    
* *Crosses zero heuristic (bins)* This heuristic applies the same logic of the crosses zero heuristic to a plot derived from the binning estimator. If the confidence interval of the low (high) tercile captures zero, and the confidence interval of the high (low) tercile does not capture zero, and the point estimate for each tercile falls in the order *(Low, Middle) > High* or *(High, Middle) > Low*, then I conclude that the binning estimator plot is consistent with the presence of an interactive effect.

* *Compare extremes heuristic* If the maximum value of the lower confidence interval is greater than the minimum value of the upper confidence interval, then I conclude that that marginal effects plot is consistent with the presence of an interactive effect. Recognizing the critiques that exist of this heuristic, note here that I study only cases where the confidence band extends to the observed maximum and minimum of a normally distributed moderator whose values are independent of the causal variable.

* *Differences between bins* If the two-sided p-value for a test of the equality of the first and third bins is less than .05, I conclude that the binning estimates are consistent with the presence of an interactive effect.

* *Coefficient and p-value* If the p-value associated with $b_3$ is less than .05, then I conclude that a standard regression-based approach is consistent with the presence of an interactive effect.

\noindent I then repeat this process hundreds of times, varying four parameters: the sample size $n$, the variance of $\epsilon$, the ratio of $\beta_1$ to $\beta_2$, and $\alpha$. The results appear below.

First, I fix $V(\epsilon) = 4, \beta_1 = 1, \beta_2 =1$, and $\alpha = 1$, and then vary sample size from $n = 100$ to 2000. There is clear evidence that with a sample size of 1000 or less, the crosses zero heuristic based on a marginal effects plot is overconfident relative to regression coefficients about the presence of an interactive effect. The performance of the same visual heuristic applied to the binning estimator is even worse. The performance of regression coefficients, a formal test of the differences between bins, and the compare extremes heuristic are all invariant to sample size.

```{r simulation.n, echo=FALSE, fig.cap="Type-1 Error Rates for Four Different Heuristics\\label{fig:vary.n}"}
plotcols<-c("black","black","red","red","red")
vary.n <- read.csv("vary.n")
vary.n.melt <- melt(vary.n, id.vars = c("n","ar2"), 
                    measure.vars = c("false.plot", "false.bin", "false.plot2", "false.bin2", "false.p"))
p <- ggplot(data=vary.n.melt, aes(x=n, y=value, group=variable, linetype=variable, color=variable)) + 
  geom_line(size=.7) 
p + geom_hline(yintercept=0.05) + 
  scale_colour_manual("", values=plotcols, labels=c("Crosses Zero Heuristic (ME Plot)", 
                                           "Crosses Zero Heuristic (Bins)",
                                           "Compare Extremes Heuristic",
                                           "Difference between Bins",
                                           "Coefficient and p-value")) +
  scale_linetype_manual("", values=c(1,2,3,2,1),  labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                  "Crosses Zero Heuristic (Bins)",
                                                  "Compare Extremes Heuristic",
                                                  "Difference between Bins",
                                                  "Coefficient and p-value")) +
  theme_classic() + labs(x = "N", y="Probability of Type-1 Error")
```

\noindent The small sample performance of the cross-zero heuristic in Figure \ref{fig:vary.n} is noteworthy because it runs counter to common expectations that small samples lead to conservative tests that are more likely to fail to reject the null when an alternative hypothesis is true. In identifying interaction effects, the crosses zero heuristic is anticonservative in small samples.

In the Appendix I vary other features of the simulations. Specifically, I vary the unexplained variance in the model ($V(\epsilon)$), the ratio of $\beta_1$ and $\beta_2$, and the value of $\alpha$. Taken together, the results provide further evidence of how the crosses zero heuristic increases the likelihood of Type-1 error.

## Discussion

The crosses zero heuristic is overconfident when interpreted to be a test of the hypothesis that the effect of $D$ varies across the range of $X$ because it is sometimes statistically distinguishable from zero. That overconfidence, moreover, depends on features of the regression such as sample size, the relative size of $\beta_1$ to $\beta_2$, and model error. The binning estimator is particularly useful for detecting nonlinear interaction effects, but if researchers apply the same visual heuristic when interpreting the plots derived from the binning estimator, they will be even more prone to uncover false interactive effects. On the other hand, the power of both the compare extremes heuristic and the coefficient on the interaction term is that their performance does not depend on sample size, stochastic variance, or the size of the causal effect of interest. The problem is that they themselves are not meaningful tests of any substantive hypothesis unless both $X$ and $D$ happen to be binary. Might it be preferable, then, to condition any inferences on the statistical significance of the interaction term? Knowing the answer depends on not just Type-1 error rates, but also Type-2 error rates. 

To explore this, I adjust the data generating process to $Y = \alpha + \beta_1D + \beta_2X + \beta_3X\cdot D + \epsilon$, where $\alpha=\beta_1=\beta_2=\beta_3=1$. In this case, the true marginal effect of $D$ on $Y$ is 0 when $X=-1$; in the simulations below, I truncate the distribution of $X$ at -1.5 to reflect a situation where the effect of $D$ on $Y$ is zero at the lowest values of $X$, and positive at higher values of $X$. Marginal effects plots are appropriate in this case because the effects of $D$ on $Y$ are constant in $X$ by construction. An example appears in Figure \ref{fig:example.le}, with $X=-1$ highlighted.

```{r setup.example.le, echo=FALSE, results="hide", warning=FALSE}
set.seed(14850)
n <- 300
error <- 4

d <- rbinom(n,1,.5)
x <- rtruncnorm(n, a=-1.5)
e <- rnorm(n,0,error)
y <- 1 + d + x + x*d + e
```
```{r plot.example.le, echo=FALSE, warning=FALSE, fig.cap="A Linear Interaction Effect\\label{fig:example.le}"}
s <- cbind.data.frame(D=d, X=x, Y=y)
p <- inter.binning(Y = "Y", D = "D", X = "X", data = s, vartype = "robust")
p$graph + theme_classic() + geom_hline(yintercept=0) + 
  geom_vline(xintercept=-1, linetype = "longdash")
```

\noindent I then test the performance of each of the five heuristics. To "stack the deck" in favor of the crosses zero heuristic, I only require that the confidence interval includes zero at $X=-1$, and that it excludes zero at the 75th percentile of $X$. In these simulations, I fix $V(\epsilon) = 4, \beta_1 = \beta_2 = \beta_3 = 1$, and $\alpha = 1$.

```{r simulation.n.le, echo=FALSE, fig.cap="Type-2 Error Rates for Four Different Heuristics\\label{fig:vary.n.le}"}
vary.n <- read.csv("vary.n.le")
vary.n.melt <- melt(vary.n, id.vars = c("n","ar2"), 
                    measure.vars = c("false.plot", "false.bin", "false.plot2", "false.bin2", "false.p"))
p <- ggplot(data=vary.n.melt, aes(x=n, y=value, group=variable, linetype=variable, color=variable)) + 
  geom_line(size=.7) 
p + geom_hline(yintercept=0.05) + 
  scale_colour_manual("", values=plotcols, labels=c("Crosses Zero Heuristic (ME Plot)", 
                                           "Crosses Zero Heuristic (Bins)",
                                           "Compare Extremes Heuristic",
                                           "Difference between Bins",
                                           "Coefficient and p-value")) +
  scale_linetype_manual("", values=c(1,2,3,2,1),  labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                  "Crosses Zero Heuristic (Bins)",
                                                  "Compare Extremes Heuristic",
                                                  "Difference between Bins",
                                                  "Coefficient and p-value")) +
  theme_classic() + labs(x = "N", y="Probability of Type-2 Error")
```

\noindent Figure \ref{fig:vary.n.le} shows that with small sample sizes, all five heuristics are likely to fail to reject the null hypothesis that there is no interaction effect when one does exist. As sample size increases, all five heuristics improve, but the crosses zero heuristic based on the marginal effects plot improves the fastest. The crosses zero heuristic applied to the binning estimator is acceptable but too conservative even with large samples. The formal test of the differences between bins only approaches the performance of the other four heuristics when the sample size is large. These results suggest that marginal effects plots are better suited than coefficients and p-values for identifying interaction effects, but only when we know that these effects exist and the sample size is relatively small. Similar conclusions may be drawn from simulations that increase the ratio of stochastic variance to systematic variance. 

Finally, I consider a case where the effects of $D$ on $Y$ are nonlinear in $X$. Specifically, I investigate a data generating process with the following form:

* where $X < -.5$, $Y = \alpha + \beta X + \epsilon$
* where $X \geq -.5$, $Y = \alpha + 2*(X+.5)^2*D + \beta X + \epsilon$

\noindent $D$ has no effect on $Y$ when $X<-0.5$, but when $X\geq -0.5$ the effect of $D$ increases as a nonlinear function of $X$ itself---the effect of $D$ is small when $X$ is small, and large when $X$ is large. In Figure \ref{fig:example.nle}, I present both the standard marginal effects plot and the kernel estimator plots from @hainmuelleretal, highlighting the range of the data $(-1.5 < X < -0.5)$ where the effect of $D$ is zero.

```{r setup.example.nle, echo=FALSE, results="hide", warning=FALSE}
set.seed(14850)
n <- 300
error <- 4

d <- rbinom(n,1,.5)
x <- rtruncnorm(n, a=-1.5)
e <- rnorm(n,0,error)
y <- rep(NA,n)
y[x<(-.5)] <- 0*d[x<(-.5)] + x[x<(-.5)] + e[x<(-.5)]
y[x>=(-.5)] <- 2*((x[x>=(-.5)]+.5)^2)*d[x>=(-.5)] + x[x>=(-.5)] + e[x>=(-.5)]
```
```{r plot.example.nle, echo=FALSE, results="hide", warning=FALSE, fig.cap="A Nonlinear Interaction Effect\\label{fig:example.nle}", fig.width=3, fig.height=3,fig.show='hold',fig.align='center'}
s <- cbind.data.frame(D=d, X=x, Y=y)
p <- inter.binning(Y = "Y", D = "D", X = "X", data = s, vartype = "robust")
p$graph + theme_classic() + geom_hline(yintercept=0) + 
  geom_vline(xintercept=-.5, linetype = "longdash") + geom_vline(xintercept=-1.5, linetype = "longdash")
p <- inter.kernel(Y = "Y", D = "D", X = "X", data = s, nboots=200)
p$graph + theme_classic() + geom_hline(yintercept=0) + 
  geom_vline(xintercept=-.5, linetype = "longdash") + geom_vline(xintercept=-1.5, linetype = "longdash")
```

\noindent Not surprisingly, the kernel estimator captures the nonlinear effect of $D$ on $Y$ better than does the marginal effects plot, which indicates a negative and statistically significant marginal effect for $D$ at the low range of $X$. I then test the performance of two of the five heuristics in capturing the "true" interactive effect in Figure \ref{fig:vary.n.nle}.^[The compare extremes heuristic, the difference betweem bins test, and the coefficients and p-values always indicate an interactive effect, but these are by design not informative of whether the effect of $D$ is zero in some range of $X$ and positive elsewhere.] In these simulations, I fix $V(\epsilon) = 4$ and $\alpha = 1$.

```{r simulation.n.nle, echo=FALSE, fig.cap="Detecting Nonlinear Interactions for Two Different Heuristics\\label{fig:vary.n.nle}"}
vary.n <- read.csv("vary.n.nle")
vary.n.melt <- melt(vary.n, id.vars = c("n","ar2"), 
                    measure.vars = c("false.plot", "false.bin"))
p <- ggplot(data=vary.n.melt, aes(x=n, y=value, group=variable, linetype=variable, color=variable)) + 
  geom_line(size=.7) 
p + geom_hline(yintercept=0.95) + scale_colour_manual("", values=c("black","black"),
                                                      labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                          "Crosses Zero Heuristic (Bins)")) +
   scale_linetype_manual("", values=c(1,2),  labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                  "Crosses Zero Heuristic (Bins)")) +
  scale_y_continuous(limits = c(0, 1)) +
  theme_classic() + labs(x = "N", y="Prob. of Detecting 'True' Effect")
```

\noindent In these simulations, the crosses zero heuristic nearly always fails to identify the correct nonlinear effect of $D$ on $Y$ in the marginal effects plot, because it shows that the marginal effect of $D$ on $Y$ is *negative and significant* at low values of $X$. The binning estimator, on the other hand, has almost a 95% chance of detecting the true nonlinear relationship between $D$ and $Y$. 

## Recommendations
This article has shown that visual heuristics used to interpret marginal effects plots can lead to misleading substantive conclusions. When there is no interaction between $D$ and $X$, the crosses zero heuristic is likely to identify a relationship that does not exist.  Relative to a simple inspection of the coefficient on the interaction term, marginal effects plots are thus overconfident. When linear interaction effects *do* exist, marginal effect plots accurately capture the substantive quantities of interest.^[@hainmuelleretal's binning estimator fares best in the presence of nonlinear interaction effects, which is the exact purpose for which it was proposed.] 

@bramboretal2006 most important contribution---amplified by @braumoeller2004 and @kamfranzese2007 in ways that have fundamentally changed research practice---is to shift researchers away from simple inspection of coefficients and standard errors when examining substantive interaction effects. However, @bramboretal2006's argument that "one cannot determine whether a model should include an interaction term simply by looking at the significance of the coefficient on the interaction term" is incorrect if interpreted to mean that the coefficient on the interaction term does not test whether the effect of $D$ differs across the values of $X$. Using interaction plots to test for the presence of interactive effects is a mistake. 

This discussion suggests some simple guidelines for applied researchers. Assuming linear interaction effects, a conservative strategy that minimizes Type-1 error and which does not depend on sample size, stochastic error, or the relative size of the causal effect of interest would be to only use coefficients and p-values to test for the presence of interaction effects. Although marginal effects plots do calculate the correct marginal effects and their confidence intervals, they do not test for the presence or absence of an interactive effect. Marginal effects plots should be used, then, only to calculate substantive quantities of interest. They are also useful, in combination with histograms of the distribution of the moderating variable, to explore the sensitivity of interaction models to the range of the moderating variable.

Another strategy to improve standard visual heuristics is to add a second reference line that corresponds to the marginal effect of $D$ evaluated at the median of $X$, as in Figure \ref{fig:margDY.xtraline}. This line exploits the properties of the compare extremes heuristic, which I demonstrated above to perform about as well as do tests of the significance of the interaction term.

```{r example.plot.xtraline, echo=FALSE, results="hide", warning=FALSE, fig.cap="The Marginal Effect of D on Y\\label{fig:margDY.xtraline}", fig.width=3, fig.height=3,fig.show='hold',fig.align='center'}
set.seed(14850)
n <- 250
error <- 4

D <- rbinom(n,1,.5)
X <- rnorm(n)
e <- rnorm(n,0,error)
Y <- 1 + D + X + e
fit <- lm(Y~D*X)
p <- interplot.medline(m=fit, var1="D",var2="X")
p + labs(x = expression('Moderator: X, b'[3]*' = 0'))

Y <- 1 + D + X + .5*D*X + e
fit2 <- lm(Y~D*X)
p <- interplot.medline(m=fit2, var1="D",var2="X")
p + labs(x = expression('Moderator: X, b'[3]*' = 0.5'))
```

\noindent This additional dotted line focuses the eye not only on whether the confidence band includes zero, but also on whether the entire confidence band spans a common value. The figure on the left plots the same model as in Figure \ref{fig:margDY}, and clearly reveals no interaction effect when $b_3=0$. The figure on the right, generated with $b_3=0.5$, reveals the appropriate interactive relationship. In combination with the histogram at the bottom of each plot, it is possible as well to inspect whether inferences depend on a few extreme values of $X$. If so, the methods proposed by @hainmuelleretal are particularly useful.

I provide open source software in \texttt{R} to create figures similar to Figure \ref{fig:margDY.xtraline} in the R package \texttt{interplot.medline}, which is based on the \texttt{interplot} package in \texttt{R} by @solthu.^[The package may be installed directly from GitHub following instructions available at https://tompepinsky.com/research/code/.] This simple addition to the standard marginal effects plot should discourage researchers from inferring that interaction effects exist when they do not. 


## References

<div id="refs"></div>
\newpage

## Appendix

In this Appendix, I present the results from additional simulations. 

First, I fix $n = 500, \beta_1 = 1, \beta_2 =1$, and $\alpha = 1$, and then vary $V(\epsilon)$ from 1 to 4. On the x-axis I plot $V(\epsilon)/1.25$, which is the ratio of the variance of the stochastic part of the model to the variance of the systematic component of the model.^[Calculate the variance of the systematic component of the model as $\big( V(X) = 1 \big) + \big( V(D) = .25 \big) = 1.25$] As the variance of the stochastic component of the model exceeds twice the variance of the systematic component, the crosses zero heuristic is overconfident in both marginal effects plots and the binning estimator. Once again, the performance of regression coefficients, differences between bins, and the compare extremes heuristic are invariant to the relative size of the stochastic variance.

```{r simulation.error, echo=FALSE, fig.cap="Type-1 Error Rates for Four Different Heuristics\\label{fig:vary.error}"}
vary.error <- read.csv("vary.error")
vary.error.melt <- melt(vary.error, id.vars = c("n","ar2","var.rat"), 
                    measure.vars = c("false.plot", "false.bin", "false.plot2", "false.bin2", "false.p"))
p <- ggplot(data=vary.error.melt, aes(x=var.rat, y=value, group=variable, linetype=variable, 
                                      color=variable)) + 
  geom_line(size=.7) 
p + geom_hline(yintercept=0.05) + 
  scale_colour_manual("", values=plotcols, labels=c("Crosses Zero Heuristic (ME Plot)", 
                                           "Crosses Zero Heuristic (Bins)",
                                           "Compare Extremes Heuristic",
                                           "Difference between Bins",
                                           "Coefficient and p-value")) +
  scale_linetype_manual("", values=c(1,2,3,2,1),  labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                  "Crosses Zero Heuristic (Bins)",
                                                  "Compare Extremes Heuristic",
                                                  "Difference between Bins",
                                                  "Coefficient and p-value")) +
  theme_classic() + labs(x = "Ratio of Stochastic to Systematic Variance", y="Probability of Type-1 Error") 
```

Next, I fix $n = 500, V(\epsilon) = 4, \beta_2 =1$, and $\alpha = 1$, and then vary $\beta_1$ from 0 to 2.5. As the true effect of $D$ becomes larger, for any given sample size and error variance, the likelihood that confidence intervals capture zero should decrease. Here, the x-axis is the ratio $b_1/b_2$. When the effect of $D$ on $Y$ is equal or less than 1.5 times the effect of $X$ on $Y$, the crosses zero heuristic is overconfident relative to regression coefficients about the presence of an interactive effect. Once again, the performance of regression coefficients, differences between bins, and the compare extremes heuristic are invariant to the relative size of $b_1$ and $b_2$.

```{r simulation.beta, echo=FALSE, fig.cap="Type-1 Error Rates for Four Different Heuristics\\label{fig:vary.b1b2}"}
vary.b1b2 <- read.csv("vary.b1b2")
vary.b1b2.melt <- melt(vary.b1b2, id.vars = c("n","ar2","beta1s"), 
                    measure.vars = c("false.plot", "false.bin", "false.plot2", "false.bin2", "false.p"))
p <- ggplot(data=vary.b1b2.melt, aes(x=beta1s, y=value, group=variable, linetype=variable, 
                                     color=variable)) + 
  geom_line(size=.7) 
p + geom_hline(yintercept=0.05) + 
  scale_colour_manual("", values=plotcols, labels=c("Crosses Zero Heuristic (ME Plot)", 
                                           "Crosses Zero Heuristic (Bins)",
                                           "Compare Extremes Heuristic",
                                           "Difference between Bins",
                                           "Coefficient and p-value")) +
  scale_linetype_manual("", values=c(1,2,3,2,1),  labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                  "Crosses Zero Heuristic (Bins)",
                                                  "Compare Extremes Heuristic",
                                                  "Difference between Bins",
                                                  "Coefficient and p-value")) +
  theme_classic() + labs(x = "Beta1/Beta2", y="Probability of Type-1 Error") 
```

Finally, I fix $n = 500, V(\epsilon) = 4, \beta_1 = 1$, and $\beta_2 = 1$, and then vary $\alpha$ from 0 to 2.5. In this case, the performance of the five tests is invariant to the value of $\alpha$---and the crosses zero heuristic is always overconfident, either when applied to a marginal effects plot or the binning estimator.

```{r simulation.alpha, echo=FALSE, fig.cap="Type-1 Error Rates for Four Different Heuristics\\label{fig:vary.alpha}"}
vary.alpha <- read.csv("vary.alpha")
vary.alpha.melt <- melt(vary.alpha, id.vars = c("n","ar2","alphas"), 
                    measure.vars = c("false.plot", "false.bin", "false.plot2", "false.bin2", "false.p"))
p <- ggplot(data=vary.alpha.melt, aes(x=alphas, y=value, group=variable, linetype=variable, 
                                      color=variable)) + 
  geom_line(size=.7) 
p + geom_hline(yintercept=0.05) + 
  scale_colour_manual("", values=plotcols, labels=c("Crosses Zero Heuristic (ME Plot)", 
                                           "Crosses Zero Heuristic (Bins)",
                                           "Compare Extremes Heuristic",
                                           "Difference between Bins",
                                           "Coefficient and p-value")) +
  scale_linetype_manual("", values=c(1,2,3,2,1),  labels=c("Crosses Zero Heuristic (ME Plot)", 
                                                  "Crosses Zero Heuristic (Bins)",
                                                  "Compare Extremes Heuristic",
                                                  "Difference between Bins",
                                                  "Coefficient and p-value")) +
  theme_classic() + labs(x = "Alpha", y="Probability of Type-1 Error") 
```